MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, Pan; Bansal, Hritik; Xia, Tony; Liu, Jiacheng; Li, Chunyuan; Hajishirzi, Hannaneh; Cheng, Hao; Chang, Kai-Wei; Galley, Michel; Gao, Jianfeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.02255 (cs)

[Submitted on 3 Oct 2023 (v1), last revised 21 Jan 2024 (this version, v3)]

Title:MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Authors:Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao

View PDF

Abstract:Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at this https URL.

Comments:	116 pages, 120 figures. Accepted to ICLR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2310.02255 [cs.CV]
	(or arXiv:2310.02255v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.02255

Submission history

From: Pan Lu [view email]
[v1] Tue, 3 Oct 2023 17:57:24 UTC (12,562 KB)
[v2] Wed, 25 Oct 2023 20:22:24 UTC (21,304 KB)
[v3] Sun, 21 Jan 2024 03:47:06 UTC (21,346 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators